Weighted p-norm distance t kernel SVM classification algorithm based on improved polarization

The kernel function in SVM enables linear segmentation in a feature space for a large number of linear inseparable data. The kernel function that is selected directly affects the classification performance of SVM. To improve the applicability and classification prediction effect of SVM in different areas, in this paper, we propose a weighted p-norm distance t kernel SVM classification algorithm based on improved polarization. A t-class kernel function is constructed according to the t distribution probability density function, and its theoretical proof is presented. To find a suitable mapping space, the t-class kernel function is extended to the p-norm distance kernel. The training samples are obtained by stratified sampling, and the affinity matrix is redefined. The improved local kernel polarization is established to obtain the optimal kernel weights and kernel parameters so that different kernel functions are weighted combinations. The cumulative optimal performance rate is constructed to evaluate the overall classification performance of different kernel SVM algorithms, and the significant effects of different p-norms on the classification performance of SVM are verified by 10 times fivefold cross-validation statistical comparison tests. In most cases, the results using 6 real datasets show that compared with the traditional kernel function, the proposed weighted p-norm distance t kernel can improve the classification prediction performance of SVM.

In the 1990s, Vapnik systematically introduced statistical learning theory and proposed the SVM algorithm 1 . Due to its excellent performance in the field of text mining 2 and fault diagnosis 3 , SVM gradually became the mainstream technology of machine learning methods and directly promoted the climax of statistical learning development. The study of the kernel method was officially initiated based on the great success of SVM, and SVM promoted the rapid popularization and application of the kernel method. The kernel method has gradually expanded into many fields of machine learning, such as pattern recognition 4 , feature selection 5 , and deep learning 6,7 . The kernel function directly determines the performance of the SVM classification algorithm 8 and various kernel methods because a proper kernel function can map samples to an appropriate feature space. In an appropriate feature space, similar samples are close together and different samples are far apart. A kernel function is introduced to greatly improve the accuracy, recognition rate, and dimension reduction efficiency of machine learning algorithms.
Subsequently, many methods based on the kernel technique have been proposed. Schkopf 9 et al. proposed a kernel trick so that principal component analysis could be utilized as a nonlinear dimension reduction technique. As a result, nonlinear mapping from high-dimensional space to low-dimensional space can be achieved and the performance of the learner is improved. Mika 10 introduced the kernel function into linear discriminant analysis (LDA), which is also known as KLDA. KLDA can address the nonlinear data analysis problem and can achieve higher accuracy than LDA. Si proposed a new and improved kernel partial least squares method to address nonlinear characteristics in industrial processes 11 . Some kernel functions have been proposed for specific fields. For example, Huma 12 et al. proposed the application of a string kernel in natural language processing to improve the efficiency of text classification. Bernhard 13 et al. studied the application of kernel methods in the field of bioinformatics.
The above methods are only based on a single kernel. Because different kernel functions have different characteristics, the performance of kernel functions varies greatly in different application scenarios. When the sample size is large, the multidimensional data are irregular or the data are not evenly distributed in the feature space. Therefore, it is not reasonable to map the training set directly by a single kernel 14,15 . To improve the flexibility and Scientific Reports | (2022) 12:6197 | https://doi.org/10.1038/s41598-022-09766-w www.nature.com/scientificreports/ applicability of the kernel function, multiple kernel functions are combined, i.e., multiple kernel learning. Multiple kernel learning has been a long-standing, well-known and practical research direction in machine learning. Gone 16 provided a taxonomy and review of several multiple kernel learning (MKL) algorithms. They concluded that multiple kernel learning is useful in practice and that a better MKL algorithm could be devised for improved accuracy and decreased complexity and training time. In recent years, many multiple kernel methods have been proposed to solve specific problems. Rakotomamonjy 17 proposed a simple MKL algorithm. In the weighted 2-norm regularization form, an additional 1-norm constraint is applied to the multikernel weight coefficients, which provides a new idea for multiple kernel learning based on mixed norm regularization. Fan 18 proposed a multiple random empirical kernel learning machine (MREKLM), which adopts the random projection idea to map samples into multiple low-dimensional empirical feature spaces with lower computational complexity. Li 19 proposed the multiple kernel learning support vector machine particle swarm optimization model to identify pulmonary nodules and obtained better recognition efficiency. Gao 20 proposed a multiple kernel learning method with the Mahalanobis distance to classify hyperspectral images. Based on the linear weighted combination of the Mahalanobis basic kernel, the hyperspectral data are mapped to a feature space with a smaller intraclass distance and larger interclass distance, and then they are classified to improve the prediction accuracy. Wang 21 proposed a new model parameter selection method for support vector machines based on adaptive fusion of multiple kernel functions and realized adaptive selection of the multiple kernel function weighted coefficient, kernel parameters and regression parameters. Ergul 22 proposed a multiple composite kernel extreme learning machine for hyperspectral images, and the obtained results were presented comparatively along with state-ofthe-art standard machine learning. The multiple kernel model has better applicability and flexibility than the single kernel model. The above works have proven that the interpretability of the decision function can be enhanced and the performance of the learner can be boosted by using multiple kernels instead of a single kernel. In the multiple kernel framework, the convex combination of several single kernels, The key to multiple kernel learning is the selection of a basic kernel and the calculation of weight coefficients. We can use the existing kernel as the basic kernel or create a new kernel according to kernel construction theory to use as the basic kernel 23 . There are two main ways to calculate the weight coefficients: heuristic algorithms 24 and optimization models. The former needs to be associated with the performance of subsequent classifiers, so it is too time-consuming, while the latter has strict theory and lower computational complexity. Examples of typical optimization models are described as follows. Lanckriet 25 obtained the weighted kernel matrix from data based on a semidefinite programming idea and solved the optimal weight coefficient. Sonnenburg 26 rewrote the convex quadratic constrained quadratic programming in reference 25 into a semi-infinite linear programming problem to solve the kernel weight. The gradient descent method was always adopted to optimize the weight by some researchers 27,28 .
Obviously, the multiple kernel model consists of several basic single kernels. The expression of the single kernel function often determines the multiple kernel performance. Single kernel functions have the advantage of simple expression and fewer parameters over multiple kernel functions and can solve specific domain problems. Their deficiency lies in the fixed expression form, which results in poor universality. To solve this problem, a more flexible multiscale kernel was introduced 29,30 . In addition, according to distance metric learning theory, samples are mapped from the original space to the feature space so that the performance obtained in the feature space is better than that in the original space 31 . Obtaining a suitable space is essentially determining the proper distance metric. Therefore, the t class kernel function with multiscale form is constructed. To obtain a suitable distance metric, the t kernel is generalized to the p-norm t kernel.
In this study, a weighted p-norm distance t kernel (WpNDtK) SVM classification algorithm based on improved polarization is proposed for solving basic kernel construction and weight coefficient computation in a multiple kernel model. The main contribution of this paper is as follows. We construct a t-class kernel and provide a theoretical proof. To map the sample to a more suitable feature space, we generalize the t-class kernel as a weighted p-norm t-class kernel and give its properties. We define the affinity matrix and build an objective function of weight coefficients and kernel parameters according to local kernel polarization. The objective function is solved by the local gradient and the generalized Lagrange multiplier algorithm. The cumulative optimal performance rate is constructed to measure the overall classification performance of SVM algorithms with different kernels. The significance of the p-norm distance on SVM classification performance is verified based on the paired data t test with 10 times fivefold cross-validation. Through a large number of experiments on 6 real datasets, the results show that SVM classification prediction can appropriately improve performance when using WpNDtK compared with the traditional kernel function.
This paper is organized as follows. In "Introduction" section, we introduce the development and application of the kernel method and the optimal solution of weight coefficients in multiple kernel learning. The basic SVM model with multiple kernels is introduced in "Kernel support vector machine" section. In "t Class kernel and its generalization: section, we describe the construction of a weighted p-norm distance t kernel and provide a theoretical proof. In "Establishment and solution of the multiple kernel model" section, we describe the construction of the optimal model of weight coefficients and kernel parameters. The flow of the weighted p-norm t kernel SVM classification algorithm is shown in Weighted p-norm t kernel SVM classification algorithm" section. Our experimental studies and an evaluation of the performance of the proposed WpNDtK SVM algorithm are presented in "Experimental results and analysis" section. The paper is concluded in "Conclusions" section and suggestions for future work are provided.

Kernel support vector machine
A support vector machine is a classification algorithm for binary classification problems and is based on the theory of structural risk minimization. Of course, SVM can also be extended to multiclass classification learning problems. The basic SVM model is a maximum interval linear classifier defined in the feature space. By introducing the kernel function, SVM essentially becomes a nonlinear classifier. The basic principle of kernel SVM is given as follows.
Given the training dataset T = {(x i , y i ) x i ∈ R d , y i ∈ {+1, −1} , i = 1, 2, ..., n} , where x i is the d dimensional input vector and y i is its class label. SVM can be formalized into the following convex quadratic programming problem.
where ω indicates the normal vector of the classification hyperplane, C is a predefined positive trade-off parameter between model simplicity and classification error, ξ i is the vector of slack variables, φ(x) is the feature vector mapped from x , and b is the bias term of the separating hyperplane. The goal of SVM is to maximize the interval 2/ ω .
The dual formulation of Model (1) is generally used when solving SVM is the kernel function and α i is the Lagrangian multiplier. The bias term b can be solved by the support vector in the training dataset. Its specific form is as follows: where x s is the support vector and n s is the number of support vectors. The final SVM classifier is For kernel SVM, the selection of the kernel function is the key to the classification performance of SVM. If the kernel function is not properly selected, the sample is mapped to an inappropriate space, which leads to a poor classification effect. To improve the performance, it is necessary to constantly explore the new kernel functions. Since different kernels are applicable to different areas, the most straightforward idea is to combine several different kernels to integrate the advantages of different kernels.
The simplest and most common way to construct a multiple kernel model is to directly combine some single kernels into convex combinations, and the basic form of this concept is as follows.
where κ i (x, y) is the basic kernel function, ω i is the kernel weight and M i=1 ω i = 1 . We can combine existing kernels or construct new classes of kernels. For the determination of kernel weight, a heuristic algorithm or optimization model can be used to solve the weight. The optimization model is used in this work to solve ω i . Section "The optimization of kernel weight and kernel parameter" provides more details.
According to Model (2), the dual formulation of SVM with multiple kernels is as follows. (1)

t Class kernel and its generalization
In many practical tasks, samples are often linearly indivisible. Therefore, it is necessary to select the appropriate kernel function to map the samples to an appropriate feature space so that the samples are linearly separable in the feature space. If the kernel function is not properly selected, the sample cannot be linearly segmented in the feature space, resulting in poor SVM classification performance. Therefore, kernel functions directly determine the performance of SVM classification. This encourages us to construct new types of kernel functions to adapt to different fields. Inspired by the t distribution probability density function, a t class kernel function is constructed.
For this kernel to have better flexibility and applicability, it is extended to the p-norm distance t kernel, and a reasonable distance measurement can be obtained by adjusting the norm.
p-norm t kernel. Theorem 1 [32] Suppose that f : X → R is a bounded continuous integrable function. Theorem 2 When n → +∞ , the t distribution probability density function.
is the kernel function, where Ŵ(·) is the gamma function.
where e − |x| 2 is the Laplacian kernel function. According to Theorem 1, Therefore, When n → ∞ , the function is the kernel function.
Scientific Reports | (2022) 12:6197 | https://doi.org/10.1038/s41598-022-09766-w www.nature.com/scientificreports/ Theorem 2 shows that when the sample size is sufficiently large, the probability density function of the t distribution can be used as the kernel function. The number of n 1 that should be taken is often determined by experimental analysis. For the convenience of kernel function application, Corollary 1 is given as follows.

Corollary 1 When n=1 , Eq. (6) is equivalent to
Then, Eq. (10) is the kernel function. By generalizing the kernel function in Corollary 1, Corollary 2 is obtained as follows.
The kernel parameter in Corollary 2 ranges from 0 to 1. We can consider expanding the range of v to increase the applicability of the kernel function.
According to the complete monotonicity of the function, Corollary 3 expands the range of kernel parameters on the basis of Corollary 2, which provides more choices for us to use the kernel function.
In practical applications, Eq. (12) is in the following form: where the number 2 indicates the 2-norm. To find an appropriate distance measure in the mapped feature space, the Euclidean distance in Eq. (10) is generalized to the p-norm distance, and we can obtain where p is the p-norm. Equation (14) is called the p-norm distance t class kernel for the short p-norm t kernel. (10) . www.nature.com/scientificreports/ The properties of the kernel function. Since the kernel function constructed in "p-norm t kernel" section is eventually extended to the form of Eq. (14), the corresponding properties are given in this section. We also discuss whether this kernel function is reasonable.
is a decreasing function of x , and. d ij = x i − x j p is the p-norm distance of any two samples. According to Property 1, the closer the sample is, the larger the kernel value is, and vice versa. When x= 0 , the kernel function is at its maximum value. This shows that the kernel function can describe the similarity between samples well. The larger the kernel value is, the higher the similarity between samples. Property 2 is illustrated by function graphs, which are drawn by fixing c= 1 and v = 1 , as shown in Fig. 1. When the scale parameters c and v are small, the kernel function can adapt to the samples with drastic changes, and when the scale parameters are large, the kernel function can adapt to the samples with gentle changes 34 so that it has better adaptability in processing complex data. Similar to the Gaussian kernel function, the constructed kernel function in Eq. (14) is also a typical multiscale kernel.

Establishment and solution of the multiple kernel model
Weighted kernel function. Because different kernel functions have different characteristics, their performance will be significantly different for different types of datasets. To make the kernel function more flexible in application, the multiple kernel learning model is formed by kernel combination. Using multiple kernels instead of a single kernel can enhance the interpretability of the decision function and result in better performance than a single kernel 35 .
When the p-norm t kernel constructed in "p-norm t kernel" section is combined, we can obtain the combination kernel as follows.
Under the framework of a multiple kernel learning model, the representation of original samples in feature space is transformed into basic kernel selection and the calculation of weight coefficients. Each basic kernel corresponds to a basic feature space and how to fuse these basic feature spaces to obtain a suitable combined feature space. That is, the data can be better represented in the combined feature space to improve the classification prediction performance. Obtaining the combined feature space is essentially a problem of optimal calculation of weight coefficients. www.nature.com/scientificreports/ Currently, there are two main methods to calculate the weight coefficient: a heuristic algorithm and an optimization algorithm. In this work, an optimization method that has a more rigorous theory is adopted to solve the weight coefficients. The key step to establish the optimization model is to give the objective function. In this study, the objective function is established based on kernel target alignment, and the optimal solution should maximize the target value. Kernel target alignment only relies on training samples and is unrelated to subsequent classifiers, so the implementation of this strategy is simple and has attracted a large amount of attention. Since the kernel function contains hyperparameters, the value of the kernel parameters also has a significant impact on the performance of the classification prediction results. Therefore, how to select the appropriate hyperparameters is also a key consideration. A direct approach is to put the kernel parameters and the weight together into the objective function for optimization.
Kernel target alignment. Kernel target alignment is a parameter optimization criterion established based on matrix alignment. This type of method only relies on training samples and is unrelated to the learning performance of subsequent classifiers. Therefore, the algorithm is simple and quick to implement, and its basic principle is as follows.
Given the training dataset D = {x 1 , x 2 , . . . , x n } and class label y = {y 1 , y 2 , ..., y n } , y i ∈ {1, 2, ..., k} shows that the dataset has k classes, and K = (κ(x i , x j )) n×n is the kernel matrix. Then, Y = yy T = (y ij ) n×n is the class label matrix and is also called the ideal kernel matrix, where. .
The goal of the kernel target alignment is to maximize the cosine value between the kernel matrix and the ideal kernel matrix, and its expression is as follows.
where < ·, · > F is the Frobenius inner product and � · � F is the Frobenius norm. Reference 36 proves the reliability and practicability of the kernel target alignment and the boundedness of the generalization error of the kernel classifier. On the basis of Eq. (15), Baram proposed kernel polarization inspired by physics 37 . It is defined as the Frobenius inner product.
where P(K) only takes between-class separability into account but neglects the preservation of within-class local structures; therefore, Wang proposed local kernel polarization (LKP) 38 , which is defined as.
The affinity coefficient is defined as where t > 0 is the adjusting parameter. From Eq. (19), the affinity coefficient A ij is defined by the Gaussian kernel function. Certainly, there should be some other more appropriate manners of defining the affinity coefficient. Therefore, we redefine the affinity coefficient in "The optimization of kernel weight and kernel parameter" section to obtain better results.
The optimization of kernel weight and kernel parameter. Based on the basic idea of the LKP, an improved local kernel polarization model is constructed to obtain the optimal kernel weights and kernel parameters. The improved part is reflected in the redefinition of the affinity coefficient in the LKP. The specific optimization model is as follows.
By redefining the affinity coefficient, we obtain.   (20), an optimization algorithm combining the local gradient and generalized Lagrange multiplier is adopted 39 . The gradient form of the model is as follows: To facilitate calculation, the parameters in Eq. (20) can be specified in advance, and for convenience c = 1. Equation (20) only contains the weighted p-norm t kernel. However, according to different field applications, the p-norm t kernel can also be combined with other types of kernel functions to obtain better classification performance.

Weighted p-norm t kernel SVM classification algorithm
According to the construction principle of the p-norm t-kernel and the establishment and solving process of the multiple kernel model, the basic flow of the weighted p-norm t kernel SVM classification algorithm is as follows. Input: Step 1: The dataset is divided into a training set and a test set by k-fold cross stratified sampling.
Step 2: A specific kernel function is selected according to Eq. (5).
Step 3: The affinity coefficient matrix is built according to Eq. (21).
Step 4: According to Eq. (18), the objective function of kernel weight and kernel parameter is established.
Step 5: Based on the training set, the local gradient and generalized Lagrange multiplier 39 are used to solve Model (20) and obtain the optimal weight coefficients ω i and kernel parameters v, γ , d.
Step 6: The optimal parameters obtained in Step 5 are substituted into Eq. (5).
Step 7: Eq. (5), which is obtained in Step 6, is substituted into Model (6) to obtain the specific dual formulation of the multiple kernel SVM.
Step 8: The training set Train obtained by stratified sampling is used to fit Model (6).
Step 9: The test set is put into the fitted Model (6) to obtain the predicted class label ŷ i . In Step 1, stratified sampling is used to prevent class imbalance in the training set and prevent the fitted SVM classification model from having class tendency. The specific form of each single kernel function must be specified in Step 2. In this study, the p-norm t-kernel constructed in "p-norm t kernel" section is mainly used for weighted combination. According to the experimental analysis in Step 6, to make use of the unique advantages of different kernel functions, the p-norm t kernel can also be combined with traditional kernel functions, including the Gaussian kernel and polynomial kernel. Steps 3 to 5 belong to the optimization process of model parameters, including the solution of weight coefficients and kernel parameters. The objective function is established according to the local kernel polarization, and the local gradient and the generalized Lagrange multiplier are used to solve it. Of course, other optimization algorithms can also be adopted. For details, please refer to reference 40 . Steps 6 to 8 fit the multiple kernel SVM model, and Step 9 predicts the test samples based on the fitted model. Finally, a specific evaluation index is used to evaluate the weighted p-norm t kernel SVM classification algorithm.

Experimental results and analysis
Experimental setting. The experimental environment uses a Windows 10 64-bit operating system with an Intel i7-9700 @ 3.0 GHz CUP and 16 GB memory. The algorithm and experiment proposed in this paper are implemented based on R language (R 3.6.3) coding. Experimental data are from the Broad Institute Genome Data Analysis Center and UCI machine learning library. The specific information is shown in Table 1..
We compare the performance of the WpNt + SVM algorithm with the following methods:  To compare the effects of different kernel functions on the performance of the SVM classification algorithm, the experiment used fivefold cross-validation to divide the training set and test set, and the evaluation criteria were classification accuracy, recall, Kappa coefficient 44 and training time. The training time of the algorithm is related to the range of parameter settings, as it often takes more time to obtain results with good performance. Different from the previous three evaluation indices, the training time of the algorithm is discussed separately in "Comparison experiment" section. Due to the large sample size of the postcode dataset, 10% random sampling is carried out in the training phase to reduce the time. Because of the high dimensionality of the breast dataset, PCA is used to reduce its dimensionality in advance. To evaluate the overall performance level of the WpNt + SVM algorithm, the optimal performance rate is constructed as follows.

(21)
where MN is the number of algorithms, DN is the number of datasets, EN is the number of evaluation indices, and PN is the number of WpNt + SVM that reaches the maximum under each evaluation index.
Equation (22) is generalized to obtain the cumulative optimal performance rate (COPR). Its definition is as follows: where PN i is the number of algorithms reaching the i th maximum under each evaluation index, and m is the number of methods.
Comparison experiment. For different datasets, SVM classification based on different kernel functions yields different prediction effects. In experimental analysis, to obtain better classification and prediction performance, flexibility is required when encountering different datasets; that is, multiple p-norm distance t kernels should be combined or p-norm distance t kernels should be combined with traditional kernel functions when encountering different datasets. To reduce the complexity of the experiment, only two kernel functions are combined, and a positive trade-off parameter C = 1 is allowed in all the SVM models. After many comparative experiments, different weighted kernel functions are selected for different datasets. The form of weighted kernel functions is mainly as follows.
Equation (24) is applied to the Kidney and Pima datasets, Eq. (25) is applied to the Postcode and Breast datasets, and Eq. (26) is applied to the Dermatology and Sonar datasets. When calculating the kernel weight and the kernel parameters, optimization Model (20) is adopted, and the aforementioned local gradient and generalized Lagrange multiplier method are used to solve the problem. The results are shown in Table 2.  (26). The p-norm value in [a, b] is set and the step size is given. For different p values, each performance index of N times k-fold cross-validation of the proposed method is calculated, including accuracy, recall and Kappa coefficient. Finally, the p-norm value corresponding to the optimal performance index is determined.
For WpNt + SVM algorithm, the objective function with the kernel weight and kernel parameter is established according to the improved local polarization. The local gradient and generalized Lagrange multiplier is adopted to obtain the optimal weights and parameters. For the other comparison algorithms, grid search strategy and k-fold cross validation are used to obtain the optimal parameters.
The different kernel SVM methods are denoted as Poly + SVM, Sig + SVM, Gau + SVM, Lap + SVM, SMKL + SVM and WpNt + SVM. These methods are used to perform fivefold cross-validation classification prediction for the 6 datasets shown in Table 1. The obtained comparative experimental results are shown in  Tables 3, 4 and 5, and the optimal results are bolded.
According to the experimental results in Tables 3, 4 and 5, the accuracy of the WpNt + SVM algorithm is optimal for 4 datasets and suboptimal in 1 datasets, the recall of the WpNt + SVM algorithm is optimal for 3 datasets and suboptimal for 1 dataset, and the Kappa coefficient of the WpNt + SVM algorithm is optimal for 3 datasets and suboptimal for 2 datasets. According to Eqs. (18) and (19), the optimal performance rate and cumulative optimal performance rate of WpNt + SVM are calculated as follows.   www.nature.com/scientificreports/ In the 6 datasets analysed, WpNt + SVM is optimal in 10 cases and suboptimal in 6 cases, and the cumulative optimal performance rate is 0.7778, which is close to 80%. This shows that the p-norm t kernel constructed for this study can effectively improve the classification and prediction performance of the SVM algorithm. In addition, the combination of the p-norm t kernel with the classical Gaussian kernel and polynomial kernel is often better than the single kernel function. Therefore, multiple learning methods can utilize the advantages of each single kernel effectively.
In classification prediction, the training time of the algorithm is also an important evaluation index. Since the final parameters of the comparison algorithm are determined by the wrapping strategy, the grid search strategy is used to set the range of hyperparameters in advance.
The specific setup information is polynomial kernel:d = 1 : 5 , and the step size is 1; Gaussian kernel σ = 0.01 : 4 , and the step size is 0.01; Laplace kernel:σ = 0.01 : 1 , and the step size is 0.05; Sigmoid kernel: β = 1 : 5, θ = −10 : −1 , and the step size is 1. The optimization model is used to solve the kernel weights and parameters of the WpNt + SVM algorithm, so there is no need to set parameters in advance. See Table 6 for the specific training time (in minutes) of all algorithms.
According to Table 6, except for the Gau + SVM algorithm, in general, the training time of WpNt + SVM is higher than that of the other comparison algorithms in most cases. It should be emphasized that for Poly + SVM, Sig + SVM, Gau + SVM and Lap + SVM, the training time is dependent on the setting range of the parameters. The optimization model is established to solve the parameters of WpNt + SVM and SMKL + SVM based on the improved local polarization. Therefore, the algorithm proposed in this study does not depend on the setting range of the parameters. The hyperparameter in the Gaussian kernel has the smallest step size compared to other single kernels. The training time of Gau + SVM is significantly higher than that of WpNt + SVM and SMKL + SVM in all datasets except the Pima dataset. This indicates that the training time of Poly + SVM, Sig + SVM, Gau + SVM and Lap + SVM will certainly exceed the training time of WpNt + SVM and SMKL + SVM if the value range of parameters is added and the step size is continuously reduced. When dealing with the large sample data, R or Python's GPU module can be called for training the model. WpNt + SVM can be parallel computing, so that the training time is reduced.
Statistical measurement comparison test of p-norm distance. For the WpNt + SVM algorithm, different p-norm distances are set for different datasets because in the process of experimental analysis, it was      The above analysis verifies the influence of different p-norm distances on SVM classification performance through a set of cross-validation results, which often has strong randomness. We need to determine whether this significant or nonsignificant effect is necessary or random, so "statistical hypothesis testing" provides an important theoretical basis 45,46 . Next, a t test based on pairwise data is used to verify whether different p-norms have a significant impact on the classification performance of SVM algorithms in 6 datasets.
The specific operation steps are as follows.
(i) Given the two different norm distances p 1 and p 2 , we perform 10 times fivefold cross-validation under the two norm distances. The two groups of classification evaluation indices of the SVM algorithm are obtained, including precision, recall and Kappa coefficient. They are represented as x i and y i . 2 and the significance level α = 0.05; (iv) The t value in Eq. (27) is calculated. If |t| > t 1−α/2 (n − 1) , then in a statistical sense, different p-norm distances have a significant effect on the SVM performance; otherwise, different p-norm distances do not have a significant effect on the SVM performance.
The null hypothesis and the alternative hypothesis in Step (ii) are equivalent to H 0 : the use of different p-norm distances has a significant effect on the classification performance of SVM, and H 1 : the use of different p-norm distances has no significant effect on the SVM classification performance. The critical value is t 0.975 (9) = 2.262 in Step (iv).
To compare whether different p-norm distances have significant effects on the performance of the proposed algorithm, the principle of the p-norm setting is as follows: (i) Let p ∈ [a, b] , and the step size is ; (ii) The algorithm performance MI i , i = 1, 2, ...s is calculated corresponding to different norms p i ; (iii) When MI i − MI j ≥ ε, 1 ≤ i, j ≤ s , the corresponding p i and p j are fixed.  www.nature.com/scientificreports/ For convenience, let a = 1, b = 10, = 0.5, ε = 0.1 . If MI i − MI j ≥ ε does not exist in [a, b] , ε is reduced appropriately. For the 6 datasets in the experiment, the 2-level p-norm distance is set, and the specific information is shown in Table 7.
According to the above steps, the test statistic is calculated and compared to the critical value. The test results are shown in Table 8.
For the Kidney, Sonar and Pima datasets, the test results in Table 8 show that there is a significant difference in accuracy, recall and Kappa coefficient. For the other three datasets, there is no difference in accuracy, recall or Kappa coefficient at different p-norm levels, which is basically consistent with the results shown in Figs. 2, 3, and 4. In summary, it can be concluded that the change in the p-norm distance for different datasets will have different influences on the classification performance of SVM. In some datasets, such as the Sonar, Pima and Kidney datasets, the influence of the change in the p-norm distance is significant; in other datasets, such as the Postcode, Dermatology and Breast datasets, the influence is of the change in the p-norm distance is minimal. Therefore, when the kernel functions have the form of the p-norm distance, such as p-norm t kernel constructed in this paper and the traditional Gaussian kernel, we need to consider the influence of the norm distance on the performance of SVM and obtain the appropriate norm distance through experimental analysis to achieve the best classification prediction effect of SVM.

Conclusions
For the classical SVM algorithm, the kernel function plays a crucial role in the classification prediction process because an appropriate kernel function can map samples to an appropriate feature space so that similar samples are close together and different samples are far apart. In view of this characteristic of the SVM algorithm, the p-norm distance t kernel is constructed according to the t probability density function, and a strict theoretical proof is given. To make use of the advantages of different types of kernel functions, the kernel functions are combined. The affinity matrix is redefined according to the local kernel polarization, and then an optimization model is established to solve the weight coefficients and kernel parameters. The weighted p-norm t kernel is applied to the SVM classification. Experimental analysis on six datasets shows that the proposed weighted p-norm t kernel can effectively improve the classification prediction performance of the SVM algorithm compared with the traditional single kernel function. Finally, the influence of the p-norm distance on the performance of the SVM algorithm is analysed based on a statistical comparison test. It is concluded that for different datasets, different norm distances will have different effects on the performance of the algorithm, some of which are significant and some of which are minimal.
The multiple kernel method based on improved local polarization in this paper is applied to SVM classification. Our method is also suitable for dimensionality reduction, kernel clustering and medical drug screening. In future work, this method will be improved and generalized in these research directions. However, the proposed method in this paper is only a simple linear combination of multiple kernel functions. There is no complete and effective theoretical basis for the selection and combination of kernel functions; the optimization of kernel weights and kernel parameters still faces the problem of nonconvergence, which needs to be further solved.